{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### XGBoost 参数\n",
    "    \n",
    "XGBoost 分为三类参数\n",
    "    1. 通用参数：宏观函数控制\n",
    "    2. Booster参数：控制每一步的booster(tree/regression)\n",
    "    3. 学习目标函数：控制训练目标的表现\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 通用参数\n",
    "booster:每次迭代的模型：gbtree：基于树的模型；gbliner:线性模型\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### booster 参数（树模型）\n",
    "\n",
    "    1. eta:通过减少每一步的权重，可以提高模型的鲁棒性，一般取值0.01-0.2\n",
    "    2. min_child_weight:决定最小叶子结点样本的权重和，这个参数可以避免过拟合，当它的值较大时，可以避免模型学习到局部的特殊样本，\n",
    "也不能过高，会导致欠拟合，这个参数需要使用CV来调整\n",
    "    3. max_depth:树的最大深度，可以用来避免过拟合，用CV调优，取值 3-10  \n",
    "    4. gamma:在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点，Gamma指定了节点分裂所需要的最小损失函数下降值。\n",
    "    5. subsample:这个参数控制每棵树，随机采样的比例  0.5-1   过小会导致欠拟合\n",
    "    6. colsample_bytree :控制每棵树随机采样的列数的占比\n",
    "    7. lambda alpha  正则化希数\n",
    "    \n",
    "        \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 学习目标函数\n",
    "1. objective 最小化损失函数\n",
    "2. eval_metric: 有效数据的度量方法，回归问题 rmse ，分类为error\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 调参实战：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import libraries:\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import xgboost as xgb\n",
    "from xgboost.sklearn import XGBClassifier\n",
    "from sklearn import cross_validation, metrics   #Additional     scklearn functions\n",
    "from sklearn.grid_search import GridSearchCV   #Perforing grid search\n",
    "\n",
    "import matplotlib.pylab as plt\n",
    "%matplotlib inline\n",
    "from matplotlib.pylab import rcParams\n",
    "rcParams['figure.figsize'] = 12, 4\n",
    "\n",
    "train = pd.read_csv('train_modified.csv')\n",
    "target = 'Disbursed'\n",
    "IDcol = 'ID'\n",
    "\"\"\"\n",
    "\n",
    "注意我import了两种XGBoost：\n",
    "xgb - 直接引用xgboost。接下来会用到其中的“cv”函数。\n",
    "XGBClassifier - 是xgboost的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):\n",
    "    if useTrainCV:\n",
    "        xgb_param = alg.get_xgb_params()\n",
    "        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)\n",
    "        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,\n",
    "            metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)\n",
    "        alg.set_params(n_estimators=cvresult.shape[0])\n",
    "\n",
    "    #Fit the algorithm on the data\n",
    "    alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')\n",
    "\n",
    "    #Predict training set:\n",
    "    dtrain_predictions = alg.predict(dtrain[predictors])\n",
    "    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]\n",
    "\n",
    "    #Print model report:\n",
    "    print \"\\nModel Report\"\n",
    "    print \"Accuracy : %.4g\" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)\n",
    "    print \"AUC Score (Train): %f\" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)\n",
    "\n",
    "    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)\n",
    "    feat_imp.plot(kind='bar', title='Feature Importances')\n",
    "    plt.ylabel('Feature Importance Score')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第一步：确定学习速率和tree_based 参数调优的估计器数目"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Choose all predictors except target & IDcols\n",
    "predictors = [x for x in train.columns if x not in [target,IDcol]]\n",
    "xgb1 = XGBClassifier(\n",
    " learning_rate =0.1,\n",
    " n_estimators=1000,    #n_estimators: 树的个数。并非越多越好，通常是50到1000之间。\n",
    " max_depth=5,\n",
    " min_child_weight=1,\n",
    " gamma=0,\n",
    " subsample=0.8,\n",
    " colsample_bytree=0.8,\n",
    " objective= 'binary:logistic',\n",
    " nthread=4,\n",
    " scale_pos_weight=1,\n",
    " seed=27)\n",
    "modelfit(xgb1, train, predictors)\n",
    "\n",
    "#在学习率指定的情况下，理想决策数数目的个数     "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第二步：max_depth 和min_weight 参数调优\n",
    "\n",
    "我们先对这两个参数调优，是因为它们对最终结果有很大的影响。首先，我们先大范围地粗调参数，然后再小范围地微调。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param_test1 = {\n",
    " 'max_depth':range(3,10,2),\n",
    " 'min_child_weight':range(1,6,2)\n",
    "}\n",
    "gsearch1 = GridSearchCV(estimator = XGBClassifier(         learning_rate =0.1, n_estimators=140, max_depth=5,\n",
    "min_child_weight=1, gamma=0, subsample=0.8,             colsample_bytree=0.8,\n",
    " objective= 'binary:logistic', nthread=4,     scale_pos_weight=1, seed=27), \n",
    " param_grid = param_test1,     scoring='roc_auc',n_jobs=4,iid=False, cv=5)\n",
    "gsearch1.fit(train[predictors],train[target])\n",
    "gsearch1.grid_scores_, gsearch1.best_params_,     gsearch1.best_score_\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第三步：gamma 参数调优"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param_test3 = {\n",
    " 'gamma':[i/10.0 for i in range(0,5)]\n",
    "}\n",
    "gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4, min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)\n",
    "\n",
    "gsearch3.fit(train[predictors],train[target])\n",
    "gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第四步： subsample 和 colsample_bytree参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param_test4 = {\n",
    " 'subsample':[i/10.0 for i in range(6,10)],\n",
    " 'colsample_bytree':[i/10.0 for i in range(6,10)]\n",
    "}\n",
    "\n",
    "gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=3, min_child_weight=4, gamma=0.1, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)\n",
    "\n",
    "gsearch4.fit(train[predictors],train[target])\n",
    "gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第五步：正则化参数防止过拟合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param_test6 = {\n",
    " 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]\n",
    "}\n",
    "gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4, min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)\n",
    "\n",
    "gsearch6.fit(train[predictors],train[target])\n",
    "gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第六步：降低学习率\n",
    "我们使用较低的学习速率，以及使用更多的决策树"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb4 = XGBClassifier(\n",
    " learning_rate =0.01,\n",
    " n_estimators=5000,\n",
    " max_depth=4,\n",
    " min_child_weight=6,\n",
    " gamma=0,\n",
    " subsample=0.8,\n",
    " colsample_bytree=0.8,\n",
    " reg_alpha=0.005,\n",
    " objective= 'binary:logistic',\n",
    " nthread=4,\n",
    " scale_pos_weight=1,\n",
    " seed=27)\n",
    "modelfit(xgb4, train, predictors)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## XGBoost 详解"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据格式：\n",
    "1. libsvm 格式的文本数据；\n",
    "2. Numpy 的二维数组；\n",
    "3. XGBoost 的二进制的缓存文件。加载的数据存储在对象 DMatrix 中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#加载libsvm格式的数据\n",
    "dtrain1 = xgb.DMatrix('train.svm.txt')\n",
    "#加载numpy的数组\n",
    "data = np.random.rand(5,10) # 5 entities, each contains 10 features\n",
    "label = np.random.randint(2, size=5) # binary target\n",
    "dtrain = xgb.DMatrix( data, label=label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 参数设置\n",
    "XGBoost使用key-value字典的方式存储参数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = {\n",
    "    'booster': 'gbtree',\n",
    "    'objective': 'multi:softmax',  # 多分类的问题\n",
    "    'num_class': 10,               # 类别数，与 multisoftmax 并用\n",
    "    'gamma': 0.1,                  # 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子。\n",
    "    'max_depth': 12,               # 构建树的深度，越大越容易过拟合\n",
    "    'lambda': 2,                   # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。\n",
    "    'subsample': 0.7,              # 随机采样训练样本\n",
    "    'colsample_bytree': 0.7,       # 生成树时进行的列采样\n",
    "    'min_child_weight': 3,\n",
    "    'silent': 1,                   # 设置成1则没有运行信息输出，最好是设置为0.\n",
    "    'eta': 0.007,                  # 如同学习率\n",
    "    'seed': 1000,\n",
    "    'nthread': 4,                  # cpu 线程数\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练模型\n",
    "有了参数列表和数据就可以训练模型了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_round = 10\n",
    "bst = xgb.train( plst, dtrain, num_round, evallist )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 模型预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# X_test类型可以是二维List，也可以是numpy的数组\n",
    "dtest = DMatrix(X_test)\n",
    "ans = model.predict(dtest)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ### 保存模型和加载\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bst.save_model('test.model')\n",
    "#你可以导出模型到txt文件并浏览模型的含义：\n",
    "# dump model\n",
    "bst.dump_model('dump.raw.txt')\n",
    "# dump model with feature map\n",
    "bst.dump_model('dump.raw.txt','featmap.txt')\n",
    "#加载模型\n",
    "bst = xgb.Booster({'nthread':4}) # init model\n",
    "bst.load_model(\"model.bin\")      # load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## XGBoost实战\n",
    "    XGBoost有两大类接口：XGBoost原生接口 和 scikit-learn接口 ，并且XGBoost能够实现 分类 和 回归 两种任务。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于XGBoost原生接口的分类\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "import xgboost as xgb\n",
    "from xgboost import plot_importance\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# read in the iris data\n",
    "iris = load_iris()\n",
    "\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565)\n",
    "\n",
    "params = {\n",
    "    'booster': 'gbtree',\n",
    "    'objective': 'multi:softmax',\n",
    "    'num_class': 3,\n",
    "    'gamma': 0.1,\n",
    "    'max_depth': 6,\n",
    "    'lambda': 2,\n",
    "    'subsample': 0.7,\n",
    "    'colsample_bytree': 0.7,\n",
    "    'min_child_weight': 3,\n",
    "    'silent': 1,\n",
    "    'eta': 0.1,\n",
    "    'seed': 1000,\n",
    "    'nthread': 4,\n",
    "}\n",
    "\n",
    "plst = params.items()\n",
    "\n",
    "\n",
    "dtrain = xgb.DMatrix(X_train, y_train)\n",
    "num_rounds = 500\n",
    "model = xgb.train(plst, dtrain, num_rounds)\n",
    "\n",
    "# 对测试集进行预测\n",
    "dtest = xgb.DMatrix(X_test)\n",
    "ans = model.predict(dtest)\n",
    "\n",
    "# 计算准确率\n",
    "cnt1 = 0\n",
    "cnt2 = 0\n",
    "for i in range(len(y_test)):\n",
    "    if ans[i] == y_test[i]:\n",
    "        cnt1 += 1\n",
    "    else:\n",
    "        cnt2 += 1\n",
    "\n",
    "print(\"Accuracy: %.2f %% \" % (100 * cnt1 / (cnt1 + cnt2)))\n",
    "\n",
    "# 显示重要特征\n",
    "plot_importance(model)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于XGBoost原生接口的回归"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgboost as xgb\n",
    "from xgboost import plot_importance\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# 读取文件原始数据\n",
    "data = []\n",
    "labels = []\n",
    "labels2 = []\n",
    "with open(\"lppz5.csv\", encoding='UTF-8') as fileObject:\n",
    "    for line in fileObject:\n",
    "        line_split = line.split(',')\n",
    "        data.append(line_split[10:])\n",
    "        labels.append(line_split[8])\n",
    "\n",
    "X = []\n",
    "for row in data:\n",
    "    row = [float(x) for x in row]\n",
    "    X.append(row)\n",
    "\n",
    "y = [float(x) for x in labels]\n",
    "\n",
    "# XGBoost训练过程\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n",
    "\n",
    "params = {\n",
    "    'booster': 'gbtree',\n",
    "    'objective': 'reg:gamma',\n",
    "    'gamma': 0.1,\n",
    "    'max_depth': 5,\n",
    "    'lambda': 3,\n",
    "    'subsample': 0.7,\n",
    "    'colsample_bytree': 0.7,\n",
    "    'min_child_weight': 3,\n",
    "    'silent': 1,\n",
    "    'eta': 0.1,\n",
    "    'seed': 1000,\n",
    "    'nthread': 4,\n",
    "}\n",
    "\n",
    "dtrain = xgb.DMatrix(X_train, y_train)\n",
    "num_rounds = 300\n",
    "plst = params.items()\n",
    "model = xgb.train(plst, dtrain, num_rounds)\n",
    "\n",
    "# 对测试集进行预测\n",
    "dtest = xgb.DMatrix(X_test)\n",
    "ans = model.predict(dtest)\n",
    "\n",
    "# 显示重要特征\n",
    "plot_importance(model)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于Scikit-learn接口的分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "import xgboost as xgb\n",
    "from xgboost import plot_importance\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# read in the iris data\n",
    "iris = load_iris()\n",
    "\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n",
    "\n",
    "# 训练模型\n",
    "model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='multi:softmax')\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# 对测试集进行预测\n",
    "ans = model.predict(X_test)\n",
    "\n",
    "# 计算准确率\n",
    "cnt1 = 0\n",
    "cnt2 = 0\n",
    "for i in range(len(y_test)):\n",
    "    if ans[i] == y_test[i]:\n",
    "        cnt1 += 1\n",
    "    else:\n",
    "        cnt2 += 1\n",
    "\n",
    "print(\"Accuracy: %.2f %% \" % (100 * cnt1 / (cnt1 + cnt2)))\n",
    "\n",
    "# 显示重要特征\n",
    "plot_importance(model)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于Scikit-learn接口的回归"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgboost as xgb\n",
    "from xgboost import plot_importance\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# 读取文件原始数据\n",
    "data = []\n",
    "labels = []\n",
    "labels2 = []\n",
    "with open(\"lppz5.csv\", encoding='UTF-8') as fileObject:\n",
    "    for line in fileObject:\n",
    "        line_split = line.split(',')\n",
    "        data.append(line_split[10:])\n",
    "        labels.append(line_split[8])\n",
    "\n",
    "X = []\n",
    "for row in data:\n",
    "    row = [float(x) for x in row]\n",
    "    X.append(row)\n",
    "\n",
    "y = [float(x) for x in labels]\n",
    "\n",
    "# XGBoost训练过程\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n",
    "\n",
    "model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='reg:gamma')\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# 对测试集进行预测\n",
    "ans = model.predict(X_test)\n",
    "\n",
    "# 显示重要特征\n",
    "plot_importance(model)\n",
    "plt.show() "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
